3.1.1 Logistic regression versus linear regression
Logistic regression differs from linear regression, which uses the Gaussian (“normal”) distribution to model the outcome variable, conditional on the covariates.
Exercise 3.4 Recall: what kinds of outcomes is linear regression used for?
Solution 3.4. Linear regression is typically used for numerical outcomes that aren’t event counts or waiting times for an event.
Examples of outcomes that are often analyzed using linear regression include:
weight
height
income
prices
3.2 Risk Estimation and Prediction
In Epi 203, you have already seen methods for modeling binary outcomes using one covariate that is also binary (such as exposure/non-exposure). In this section, we review one-covariate analyses, with a special focus on risk ratios and odds ratios, which are important concepts for interpreting logistic regression.
Example 3.1 (Oral Contraceptive Use and Heart Attack)
Research question: how does oral contraceptive (OC) use affect the risk of myocardial infarction (MI; a.k.a. heart attack)?
This was an issue when oral contraceptives were first developed, because the original formulations used higher concentrations of hormones. Modern OCs don’t have this issue.
Table 3.1 contains simulated data for an imaginary follow-up (a.k.a. prospective) study in which two groups are identified, one using OCs and another not using OCs, and both groups are tracked for three years to determine how many in each groups have MIs.
Exercise 3.6 What does the term “controls” mean in the context of study design?
Solution 3.6.
Definition 3.2 (Two meanings of “controls”) Depending on context, “controls” can mean either:
individuals who don’t experience an exposure of interest,
or individuals who don’t experience an outcome of interest.
Exercise 3.7 What types of studies do the two definitions of controls correspond to?
Solution 3.7.
Definition 3.3 (cases and controls in retrospective studies) In retrospective case-control studies, participants who experience the outcome of interest are called cases, while participants who don’t experience that outcome are called controls.
Definition 3.4 (treatment groups and control groups in prospective studies) In prospective studies, the group of participants who experience the treatment or exposure of interest is called the treatment group, while the participants who receive the baseline or comparison treatment (for example, clinical trial participants who receive a placebo or a standard-of-care treatment rather than an experimental treatment) are called controls.
3.3 Comparing probabilities
3.3.1 Risk differences
The simplest comparison of two probabilities, \(\pi_1\), and \(\pi_2\), is the difference of their values:
Definition 3.5 (Risk difference) The risk difference of two probabilities, \(\pi_1\), and \(\pi_2\), is the difference of their values: \[\delta(\pi_1,\pi_2) \stackrel{\text{def}}{=}\pi_1 - \pi_2\]
Example 3.2 (Difference in MI risk) In Example 3.1, the maximum likelihood estimate of the difference in MI risk between OC users and OC non-users is:
Exercise 3.11 How can we interpret the preceding relative risk estimate in prose?
Solution 3.11.
We might summarize this result by saying that “the estimated probability of MI among OC users was 3.714286 times as high as, or 2.714286 times higher than, the estimated probability among OC non-users.
3.3.3 Relative risk differences
Definition 3.7 (Relative risk difference)
Sometimes, we divide the risk difference by the comparison probability; the result is called the relative risk difference:
3.3.4 Changing reference groups in risk comparisons
Risk differences, risk ratios, and relative risk differences are defined by two probabilities, plus a choice of which probability is the baseline or reference probability (i.e., which probability is the subtrahend of the risk difference or the denominator of the risk ratio).
In logistic regression, we will make use of a mathematically-convenient transformation of probability, called odds.
Definition 3.8 (Odds) The odds of an outcome \(A\), which we will represent using \(\omega\) (“omega”), is the probability that the outcome occurs, divided by the probability that it doesn’t occur:
Example 3.4 (Computing odds from probabilities) In Exercise 3.5, we estimated that the probability of MI, given OC use, is \(\pi(OC) \stackrel{\text{def}}{=}\Pr(MI|OC) = 0.0026\). If this estimate is correct, then the odds of MI, given OC use, is:
Theorem 3.3 (One-sample MLE for odds) Let \(X_1,...X_n\) be a set of \(n\)\(\text{iid}\) Bernoulli trials, and let \(X = \sum_{i=1}^nX_i\) be their sum.
Then the maximum likelihood estimate of the odds of \(X=1\), \(\omega\), is:
Lemma 3.1 (Odds of a non-event) If \(\pi\) is the odds of event \(A\) and \(\omega\) is the corresponding odds of \(A\), then the odds of \(\neg A\) are:
\[
\begin{aligned}
\omega(\neg A) &= \frac{1-\pi}{\pi}
\\&= \pi^{-1}-1
\end{aligned}
\]
Proof. Left to the reader.
3.4.2 The inverse odds function
Definition 3.10 (inverse odds function) The inverse odds function,
\[\text{invodds}\left\{\omega\right\} \stackrel{\text{def}}{=}\frac{\omega}{1 + \omega}\] converts odds into their corresponding probabilities (Figure 3.2).
The inverse-odds function takes an odds as input and produces a probability as output. Its domain of inputs is \([0,\infty)\) and its range of outputs is \([0,1]\).
I haven’t seen anyone give the inverse-odds function a concise name; maybe \(\text{prob}()\) or \(\text{probs}()\) or \(\text{risk}()\)?
Lemma 3.2 If \(\pi\) is the probability of an event and \(\omega\) is the corresponding odds of that events, then:
This formula is derived as follows: Recall that odds are defined as \(\omega= \frac{\Pr(A)}{\Pr(\neg A)}\), where \(\Pr(\neg A) = 1 - \Pr(A)\). Solving for \(\Pr(A)\) in terms of \(\omega\), we get:
Exercise 3.20 For Table 3.2, show that \(\hat\theta(Exposed, Unexposed) = \hat\theta(Event, \neg Event)\).
Conditional odds ratios have the same reversibility property:
Theorem 3.8 (Conditional odds ratios are reversible) For any three events \(A\), \(B\), \(C\):
\[\theta(A|B,C) = \theta(B|A,C)\]
Proof. Apply the same steps as for Theorem 3.7, inserting \(C\) into the conditions (RHS of \(|\)) of every expression.
Odds Ratios vs Probability (Risk) Ratios
When the outcome is rare (i.e., its probability is small) for both groups being compared in an odds ratio, the odds of the outcome will be similar to the probability of the outcome, and thus the risk ratio will be similar to the odds ratio.
Case 1: rare events
For rare events, odds ratios and probability (a.k.a. risk, a.k.a. prevalence) ratios will be close:
Example 3.8 In Example 3.1, the outcome is rare for both OC and non-OC participants, so the odds for both groups are similar to the corresponding probabilities, and the odds ratio is similar the risk ratio.
Case 2: frequent events
\[\pi_1 = .4\]
\[\pi_2 = .5\]
For more frequently-occurring outcomes, this won’t be the case:
Table 3.1 simulates a follow-up study in which two populations were followed and the number of MI’s was observed. The risks are \(P(MI|OC)\) and \(P(MI|\neg OC)\) and we can estimate these risks from the data.
But suppose we had a case-control study in which we had 100 women with MI and selected a comparison group of 100 women without MI (matched as groups on age, etc.). Then MI is not random, and we cannot compute P(MI|OC) and we cannot compute the risk ratio. However, the odds ratio however can be computed.
The disease odds ratio is the odds for the disease in the exposed group divided by the odds for the disease in the unexposed group, and we cannot validly compute and use these separate parts.
We can still validly compute and use the exposure odds ratio, which is the odds for exposure in the disease group divided by the odds for exposure in the non-diseased group (because exposure can be treated as random):
And these two odds ratios, \(\hat\theta(MI|OC)\) and \(\hat\theta(OC|MI)\), are mathematically equivalent, as we saw in Section 3.4.3.2:
\[\hat\theta(MI|OC) = \hat\theta(OC|MI)\]
Exercise 3.21 Calculate the odds ratio of MI with respect to OC use, assuming that Table 3.1 comes from a case-control study. Confirm that the result is the same as in Example 3.6.
Definition 3.13 (expit, logistic, inverse-logit) The expit function (Figure 3.4) of a log-odds \(\eta\), also known as the inverse-logit function or logistic function, is the inverse-odds of the exponential of \(\eta\):
In Example 3.1, we estimated the risk and the odds of MI for two groups, defined by oral contraceptive use.
If the predictor is quantitative (dose) or there is more than one predictor, the task becomes more difficult.
In this case, we will use logistic regression, which is a generalization of the linear regression models you have been using that can account for a binary response instead of a continuous one.
3.6.1 Independent binary outcomes - general
For a data set of mutually independent binary outcomes \(\tilde{y}= (y_1, ..., y_n)\):
Exercise 3.23 Write the likelihood of the data set.
The slope is steepest at \(\pi = 0.5\), i.e., at \(\eta = 0\), which for a unipredictor model occurs at \(x = -\alpha/\beta\). The slope goes to 0 as \(x\) goes to \(-\infty\) or \(+\infty\) (compare with Figure 3.4).
Note
In order to interpret \(\beta_j\), differentiate or difference \({\color{red}\eta(\tilde{x})}\) with respect to \({\color{red}x_j}\):
In order to find the MLE \(\hat{\tilde{\beta}}\), differentiate the log-likelihood function \({\color{blue}\ell(\tilde{\beta})}\) with respect to \({\color{blue}\tilde{\beta}}\):
The difference is due to the binomial coefficient \(\left(n\atop x \right)\) which isn’t included in the individual-observations (Bernoulli) version of the model.
3.7 Fitting logistic regression models with maximum likelihood estimation
Table 3.9: Fitted logistic regression model for beetles data
Parameter
Log-Odds
SE
95% CI
z
p
(Intercept)
-60.72
5.18
(-71.44, -51.08)
-11.72
< .001
dose
34.27
2.91
(28.85, 40.30)
11.77
< .001
Solution 3.18.
Show R code
odds_inv<-function(omega)(1+omega^-1)^-1lik_beetles0<-function(beta_0, beta_1){beetles|>mutate( eta =beta_0+beta_1*dose, omega =exp(eta), pi =odds_inv(omega), Lik =pi^died*(1-pi)^survived)|>pull(Lik)|>prod()}lik_beetles<-Vectorize(lik_beetles0)ranges<-beetles_glm|>confint.default(level =0.8)n_points<-250beta_0s<-seq(ranges["(Intercept)", 1],ranges["(Intercept)", 2], length.out =n_points)beta_1s<-seq(ranges["dose", 1],ranges["dose", 2], length.out =n_points)names(beta_0s)<-round(beta_0s, 2)names(beta_1s)<-round(beta_1s, 2)if(run_graph){lik_mat_beetles<-outer(beta_0s, beta_1s, lik_beetles)plotly::plot_ly( type ="surface", x =~beta_0s, y =~beta_1s, z =~t(lik_mat_beetles))# see https://stackoverflow.com/questions/69472185/correct-use-of-coordinates-to-plot-surface-data-with-plotly # for explanation of why transpose `t()` is needed}
Figure 3.10: Likelihood of beetles data
3.7.3 Log-likelihood function
Exercise 3.27 Find the log-likelihood function for the general logistic regression model.
Exercise 3.28 Compute and graph the log-likelihood for the beetles data.
Solution 3.20.
Show R code
odds_inv<-function(omega)(1+omega^-1)^-1llik_beetles0<-function(beta_0, beta_1){beetles|>mutate( eta =beta_0+beta_1*dose, omega =exp(eta), pi =odds_inv(omega), # need for next line: llik =died*eta+n*log(1-pi))|>pull(llik)|>sum()}llik_beetles<-Vectorize(llik_beetles0)# to check that we implemented it correctly:# ests = coef(beetles_glm_ungrouped)# logLik(beetles_glm_ungrouped)# llik_beetles(ests[1], ests[2])if(run_graph){llik_mat_beetles<-outer(beta_0s, beta_1s, llik_beetles)plotly::plot_ly( type ="surface", x =~beta_0s, y =~beta_1s, z =~t(llik_mat_beetles))}
Figure 3.11: log-likelihood of beetles data
Let’s center dose:
Show R code
beetles_glm_grouped_centered<-beetles|>glm( formula =cbind(died, survived)~dose_c, family ="binomial")beetles_glm_ungrouped_centered<-beetles_long|>mutate(died =outcome)|>glm( formula =died~dose_c, family ="binomial")equatiomatic::extract_eq(beetles_glm_ungrouped_centered)
Table 3.10: Fitted logistic regression model for beetles data, with dose centered
Parameter
Log-Odds
SE
95% CI
z
p
(Intercept)
0.74
0.14
(0.48, 1.02)
5.40
< .001
dose c
34.27
2.91
(28.85, 40.30)
11.77
< .001
Show R code
odds_inv<-function(omega)(1+omega^-1)^-1lik_beetles0<-function(beta_0, beta_1){beetles|>mutate( eta =beta_0+beta_1*dose_c, omega =exp(eta), pi =odds_inv(omega), Lik =(pi^died)*(1-pi)^(survived))|>pull(Lik)|>prod()}lik_beetles<-Vectorize(lik_beetles0)ranges<-beetles_glm_grouped_centered|>confint.default(level =.8)n_points<-100beta_0s<-seq(ranges["(Intercept)", 1],ranges["(Intercept)", 2], length.out =n_points)beta_1s<-seq(ranges["dose_c", 1],ranges["dose_c", 2], length.out =n_points)names(beta_0s)<-round(beta_0s, 2)names(beta_1s)<-round(beta_1s, 2)if(run_graph){lik_mat_beetles<-outer(beta_0s, beta_1s, lik_beetles)plotly::plot_ly( type ="surface", x =~beta_0s, y =~beta_1s, z =~t(lik_mat_beetles))}
Figure 3.12: Likelihood of beetles data (centered model)
Show R code
odds_inv<-function(omega)(1+omega^-1)^-1llik_beetles0<-function(beta_0, beta_1){beetles|>mutate( eta =beta_0+beta_1*dose_c, omega =exp(eta), pi =odds_inv(omega), llik =died*eta+n*log(1-pi))|>pull(llik)|>sum()}llik_beetles<-Vectorize(llik_beetles0)if(run_graph){llik_mat_beetles<-outer(beta_0s, beta_1s, llik_beetles)plotly::plot_ly( type ="surface", x =~beta_0s, y =~beta_1s, z =~t(llik_mat_beetles))}
Figure 3.13: log-likelihood of beetles data (centered model)
Let’s first combine \({\color{green}\frac{\partial \omega}{\partial \eta}} {\color{blue}\frac{\partial \pi}{\partial \omega}} = {\color{orange}\frac{\partial \pi}{\partial \eta}}\) to get two more results:
Example 3.9 In our model for the beetles data, we only have an intercept plus one covariate, gas concentration (\(c\)): \[\tilde{x}= (1, c)\]
If \(c_i\) is the gas concentration for the beetle in observation \(i\), and \(\tilde{c} = (c_1, c_2, ...c_n)\), then the score equation \(\tilde{\ell'} = 0\) means that for the MLE \(\hat{\tilde{\beta}}\):
the sum of the errors (aka deviations) equals 0:
\[\sum_{i=1}^n\varepsilon_i = 0\]
the weighted sum of the error times the gas concentrations equals 0:
\[\sum_{i=1}^nc_i \varepsilon_i = 0 \]
Exercise 3.29 Implement and graph the score function for the beetles data
Solution 3.21.
Show R code
odds_inv<-function(omega)(1+omega^-1)^-1score_fn_beetles_beta0_0<-function(beta_0, beta_1){beetles|>mutate( eta =beta_0+beta_1*dose_c, omega =exp(eta), pi =odds_inv(omega), mu =pi*n, epsilon =died-mu, score =epsilon)|>pull(score)|>sum()}score_fn_beetles_beta_0<-Vectorize(score_fn_beetles_beta0_0)score_fn_beetles_beta1_0<-function(beta_0, beta_1){beetles|>mutate( eta =beta_0+beta_1*dose_c, omega =exp(eta), pi =odds_inv(omega), mu =pi*n, epsilon =died-mu, score =dose_c*epsilon)|>pull(score)|>sum()}score_fn_beetles_beta_1<-Vectorize(score_fn_beetles_beta1_0)if(run_graph){scores_beetles_beta_0<-outer(beta_0s, beta_1s, score_fn_beetles_beta_0)scores_beetles_beta_1<-outer(beta_0s, beta_1s, score_fn_beetles_beta_1)plotly::plot_ly( x =~beta_0s, y =~beta_1s)|>plotly::add_markers( type ="scatter", x =coef(beetles_glm_grouped_centered)["(Intercept)"], y =coef(beetles_glm_grouped_centered)["dose_c"], z =0, marker =list(color ="black"), name ="MLE")|>plotly::add_surface( z =~t(scores_beetles_beta_1), name ="score_beta_1", colorscale =list(c(0, 1), c("red", "green")), showscale =FALSE, contours =list( z =list( show =TRUE, start =-1, end =1, size =.1)), opacity =0.75)|>plotly::add_surface( z =~t(scores_beetles_beta_0), colorscale =list(c(0, 1), c("yellow", "blue")), showscale =FALSE, contours =list( z =list( show =TRUE, start =-14, end =14, size =2)), name ="score_beta_0")|>plotly::layout(legend =list(orientation ="h"))}
Figure 3.14: score functions of beetles data (centered model), by parameter (\(\beta_0\) and \(\beta_1\)), with MLE marked in black.
where \(\mathbf{D} \stackrel{\text{def}}{=}\text{diag}(\text{Var}\left(Y_i|X_i=x_i\right))\) is the diagonal matrix whose \(i^{th}\) diagonal element is \(\text{Var}\left(Y_i|X_i=x_i\right)\).
We make an iterative series of guesses, and each guess helps us make the next guess better (i.e., higher log-likelihood). You can see some information about this process like so:
After each iteration of the fitting procedure, the deviance (\(2(\ell_{\text{full}} - \ell(\hat\beta))\) ) is printed. You can see that the algorithm took 5 iterations to converge to a solution where the likelihood wasn’t changing much anymore.
Table 3.11 and Table 3.12 show the fitted model and the covariance matrix of the estimates, respectively.
Table 3.11: Fitted model for beetles data
Show R code
beetles_glm_ungrouped|>summary()#> #> Call:#> glm(formula = outcome ~ dose, family = "binomial", data = beetles_long, #> control = glm.control(trace = TRUE))#> #> Coefficients:#> Estimate Std. Error z value Pr(>|z|) #> (Intercept) -60.72 5.18 -11.7 <2e-16 ***#> dose 34.27 2.91 11.8 <2e-16 ***#> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#> #> (Dispersion parameter for binomial family taken to be 1)#> #> Null deviance: 645.44 on 480 degrees of freedom#> Residual deviance: 372.47 on 479 degrees of freedom#> AIC: 376.5#> #> Number of Fisher Scoring iterations: 5
Table 3.12: Parameter estimate covariance matrix for beetles data
“The Western Collaborative Group Study (WCGS) was a large epidemiological study designed to investigate the association between the”type A” behavior pattern and coronary heart disease (CHD)“.
Exercise 3.32 What is “type A” behavior?
Solution 3.24. From Wikipedia, “Type A and Type B personality theory”:
“The hypothesis describes Type A individuals as outgoing, ambitious, rigidly organized, highly status-conscious, impatient, anxious, proactive, and concerned with time management….
The hypothesis describes Type B individuals as a contrast to those of Type A. Type B personalities, by definition, are noted to live at lower stress levels. They typically work steadily and may enjoy achievement, although they have a greater tendency to disregard physical or mental stress when they do not achieve.”
Study design
from ?faraway::wcgs:
3154 healthy young men aged 39-59 from the San Francisco area were assessed for their personality type. All were free from coronary heart disease at the start of the research. Eight and a half years later change in CHD status was recorded.
Details (from faraway::wcgs)
The WCGS began in 1960 with 3,524 male volunteers who were employed by 11 California companies. Subjects were 39 to 59 years old and free of heart disease as determined by electrocardiogram. After the initial screening, the study population dropped to 3,154 and the number of companies to 10 because of various exclusions. The cohort comprised both blue- and white-collar employees.
3.8.2 Baseline data collection
socio-demographic characteristics:
age
education
marital status
income
occupation
physical and physiological including:
height
weight
blood pressure
electrocardiogram
corneal arcus
biochemical measurements:
cholesterol and lipoprotein fractions;
medical and family history and use of medications;
behavioral data:
Type A interview,
smoking,
exercise
alcohol use.
Later surveys added data on:
anthropometry
triglycerides
Jenkins Activity Survey
caffeine use
Average follow-up continued for 8.5 years with repeat examinations.
3.8.3 Load the data
Here, I load the data:
Show R code
### load the data directly from a UCSF website:library(haven)url<-paste0(# I'm breaking up the url into two chunks for readability"https://regression.ucsf.edu/sites/g/files/","tkssra6706/f/wysiwyg/home/data/wcgs.dta")wcgs<-haven::read_dta(url)
Table 3.14: Baseline characteristics by CHD status at end of follow-up
Characteristic
Overall
N = 3,1541
No
N = 2,8971
Yes
N = 2571
p-value
Age (years)
45.0 (42.0, 50.0)
45.0 (41.0, 50.0)
49.0 (44.0, 53.0)
Arcus Senilis
941 (30%)
839 (29%)
102 (40%)
Missing
2
0
2
Behavioral Pattern
A1
264 (8.4%)
234 (8.1%)
30 (12%)
A2
1,325 (42%)
1,177 (41%)
148 (58%)
B3
1,216 (39%)
1,155 (40%)
61 (24%)
B4
349 (11%)
331 (11%)
18 (7.0%)
Body Mass Index (kg/m2)
24.39 (22.96, 25.84)
24.39 (22.89, 25.84)
24.82 (23.63, 26.50)
Total Cholesterol
223 (197, 253)
221 (195, 250)
245 (222, 276)
Missing
12
12
0
Diastolic Blood Pressure
80 (76, 86)
80 (76, 86)
84 (80, 90)
Behavioral Pattern
Type B
1,565 (50%)
1,486 (51%)
79 (31%)
Type A
1,589 (50%)
1,411 (49%)
178 (69%)
Height (inches)
70.00 (68.00, 72.00)
70.00 (68.00, 72.00)
70.00 (68.00, 71.00)
Ln of Systolic Blood Pressure
4.84 (4.79, 4.91)
4.84 (4.77, 4.91)
4.87 (4.82, 4.97)
Ln of Weight
5.14 (5.04, 5.20)
5.13 (5.04, 5.20)
5.16 (5.09, 5.22)
Cigarettes per day
0 (0, 20)
0 (0, 20)
20 (0, 30)
Systolic Blood Pressure
126 (120, 136)
126 (118, 136)
130 (124, 144)
Current smoking
1,502 (48%)
1,343 (46%)
159 (62%)
Observation (follow up) time (days)
2,942 (2,842, 3,037)
2,952 (2,864, 3,048)
1,666 (934, 2,400)
Type of CHD Event
None
0 (0%)
0 (0%)
0 (0%)
infdeath
2,897 (92%)
2,897 (100%)
0 (0%)
silent
135 (4.3%)
0 (0%)
135 (53%)
angina
71 (2.3%)
0 (0%)
71 (28%)
4
51 (1.6%)
0 (0%)
51 (20%)
Weight (lbs)
170 (155, 182)
169 (155, 182)
175 (162, 185)
Weight Category
< 140
232 (7.4%)
217 (7.5%)
15 (5.8%)
140-170
1,538 (49%)
1,440 (50%)
98 (38%)
170-200
1,171 (37%)
1,049 (36%)
122 (47%)
> 200
213 (6.8%)
191 (6.6%)
22 (8.6%)
RECODE of age (Age)
35-40
543 (17%)
512 (18%)
31 (12%)
41-45
1,091 (35%)
1,036 (36%)
55 (21%)
46-50
750 (24%)
680 (23%)
70 (27%)
51-55
528 (17%)
463 (16%)
65 (25%)
56-60
242 (7.7%)
206 (7.1%)
36 (14%)
1 Median (Q1, Q3); n (%)
3.8.6 Data by age and personality type
For now, we will look at the interaction between age and personality type (dibpat). To make it easier to visualize the data, we summarize the event rates for each combination of age:
Show R code
library(dplyr)odds<-function(pi)pi/(1-pi)chd_grouped_data<-wcgs|>summarize( .by =c(age, dibpat), n =sum(chd69%in%c("Yes", "No")), x =sum(chd69=="Yes"))|>mutate( `n - x` =n-x, `p(chd)` =(x/n)|>labelled(label ="CHD Event by 1969"), `odds(chd)` =`p(chd)`/(1-`p(chd)`), `logit(chd)` =log(`odds(chd)`))chd_grouped_data
Figure 3.15: CHD rates by age group, probability scale
Odds scale
Show R code
odds_inv<-function(omega)omega/(1+omega)trans_odds<-trans_new( name ="odds", transform =odds, inverse =odds_inv)chd_plot_odds<-chd_plot_probs+scale_y_continuous( trans =trans_odds, # this line changes the vertical spacing name =chd_plot_probs$labels$y, sec.axis =sec_axis(~odds(.), name ="odds(CHD Event by 1969)"))print(chd_plot_odds)
Figure 3.16: CHD rates by age group, odds spacing
Log-odds (logit) scale
Show R code
logit<-function(pi)log(odds(pi))expit<-function(eta)odds_inv(exp(eta))trans_logit<-trans_new( name ="logit", transform =logit, inverse =expit)chd_plot_logit<-chd_plot_probs+scale_y_continuous( trans =trans_logit, # this line changes the vertical spacing name =chd_plot_probs$labels$y, breaks =c(seq(.01, .1, by =.01), .15, .2), minor_breaks =NULL, sec.axis =sec_axis(~logit(.), name ="log(odds(CHD Event by 1969))"))print(chd_plot_logit)
Figure 3.17: CHD data (logit-scale)
3.8.8 Logistic regression models for CHD data
For the wgcs dataset, let’s consider a logistic regression model for the outcome of Coronary Heart Disease (\(Y\); chd in computer output):
\(Y = 1\) if an individual developed CHD by the end of the study;
\(Y = 0\) if they have not developed CHD by the end of the study.
Let’s include two covariates, plus their interaction:
\(A\): age at study enrollment (age, recorded in years)
\(P\): personality type (dibpat):
\(P = 1\) represents “Type A personality”,
\(P = 0\) represents “Type B personality”.
\(PA\): the interaction of personality type and age (type:dibpat)
We can graph our fitted models on each scale (probability, odds, log-odds).
probability scale
Show R code
curve_type_A<-function(x){# nolint: object_name_linterchd_glm_contrasts|>predict( type ="response", newdata =tibble(age =x, dibpat ="Type A"))}curve_type_B<-function(x){# nolint: object_name_linterchd_glm_contrasts|>predict( type ="response", newdata =tibble(age =x, dibpat ="Type B"))}chd_plot_probs_2<-chd_plot_probs+geom_function( fun =curve_type_A,aes(col ="Type A"))+geom_function( fun =curve_type_B,aes(col ="Type B"))print(chd_plot_probs_2)
Show R code
chd_plot_odds_2<-chd_plot_odds+geom_function( fun =curve_type_A,aes(col ="Type A"))+geom_function( fun =curve_type_B,aes(col ="Type B"))print(chd_plot_odds_2)
odds scale
log-odds (logit) scale
Show R code
chd_plot_logit_2<-chd_plot_logit+geom_function( fun =curve_type_A,aes(col ="Type A"))+geom_function( fun =curve_type_B,aes(col ="Type B"))print(chd_plot_logit_2)
Figure 3.18
3.8.10 Interpreting the model parameters
Exercise 3.33 Derive interpretations of \(\beta_0\), \(\beta_P\), \(\beta_A\), and \(\beta_{PA}\) on the odds and log-odds scales, State the interpretations concisely in math and in words.
\(\beta_{0}\) is the log of the odds (“log-odds”) of experiencing CHD for a 0 year-old person with a type B personality; that is,
\(\text{e}^{\beta_{0}}\) is the odds of experiencing CHD for a 0 year-old with a type B personality,
\[
\begin{aligned}
\text{exp}\left\{\beta_{0}\right\}
&= \frac{\Pr(Y= 1| P = 0, A = 0)}{1-\Pr(Y= 1| P = 0, A = 0)} \\
&= \frac
{\Pr(Y= 1| P = 0, A = 0)}
{\Pr(Y= 0| P = 0, A = 0)}
\end{aligned}
\]
\[\beta_A = \frac{\partial}{\partial a}\eta(P=0, A = a)\]
\(\beta_A\) is the slope of the log-odds of CHD with respect to age, for individuals with personality type B.
Alternatively:
\[
\begin{aligned}
\beta_{A}
&= \eta(P = 0, A = a + 1)- \eta(P = 0, A = a)
\end{aligned}
\]
That is, \(\beta_{A}\) is the difference in log-odds of experiencing CHD experiencing CHD per one-year difference in age between two individuals with type B personalities.
\[
\begin{aligned}
\text{exp}\left\{\beta_{A}\right\} &= \text{exp}\left\{\eta(P = 0, A = a + 1)- \eta(P = 0, A = a)\right\}
\\
&= \frac{\text{exp}\left\{\eta(P = 0, A = a + 1)\right\}}{\text{exp}\left\{\eta(P = 0, A = a)\right\}}
\\
&= \frac{\omega(P = 0, A = a + 1)}{\omega(P = 0, A = a)}
\\
&= \frac
{\text{odds}(Y= 1|P=0, A = a + 1)}
{\text{odds}(Y= 1|P=0, A = a)}
\\
&= \theta(\Delta a = 1 | P = 0)
\end{aligned}
\]
The odds ratio of experiencing CHD (aka “the odds ratio”) differs by a factor of \(\text{e}^{\beta_{A}}\) per one-year difference in age between individuals with type B personality.
\(\beta_{P}\) is the difference in log-odds of experiencing CHD for a 0 year-old person with type A personality compared to a 0 year-old person with type B personality; that is,
\(\text{e}^{\beta_{P}}\) is the ratio of the odds (aka “the odds ratio”) of experiencing CHD, for a 0-year old individual with type A personality vs a 0-year old individual with type B personality; that is,
\[
\text{exp}\left\{\beta_{P}\right\}
= \frac
{\text{odds}(Y= 1|P=1, A = 0)}
{\text{odds}(Y= 1|P=0, A = 0)}
\]
\[
\begin{aligned}
\frac{\partial}{\partial a}\eta(P={\color{red}1}, A = a) &= {\color{red}\beta_A + \beta_{PA}}
\\
\frac{\partial}{\partial a}\eta(P={\color{blue}0}, A = a) &= {\color{blue}\beta_A}
\end{aligned}
\]
Therefore:
\[
\begin{aligned}
\frac{\partial}{\partial a}\eta(P={\color{red}1}, A = a) - \frac{\partial}{\partial a}\eta(P={\color{blue}0}, A = a) &= {\color{red}\beta_A + \beta_{PA}} - {\color{blue}\beta_A}
\\
&= \beta_{PA}
\end{aligned}
\]
That is,
\[\beta_{PA} = \frac{\partial}{\partial a}\eta(P={\color{red}1}, A = a) - \frac{\partial}{\partial a}\eta(P={\color{blue}0}, A = a)\]
\(\beta_{PA}\) is the difference in the slopes of log-odds over age between participants with Type A personalities and participants with Type B personalities.
Accordingly, the odds ratio of experiencing CHD per one-year difference in age differs by a factor of \(\text{e}^{\beta_{PA}}\) for participants with type A personality compared to participants with type B personality; that is,
\[
\begin{aligned}
\theta(\Delta a = 1 | P = 1)
= \text{exp}\left\{\beta_{PA}\right\} \times \theta(\Delta a = 1 | P = 0)
\end{aligned}
\]
or equivalently:
\[
\text{exp}\left\{\beta_{PA}\right\} =
\frac
{\theta(\Delta a = 1 | P = 1)}
{\theta(\Delta a = 1 | P = 0)}
\]
3.8.11 General formula for odds ratios in logistic regression
Exercise 3.34 If I give you model 1, how would you get the coefficients of model 2?
3.9 Model comparisons for logistic models
3.9.1 Deviance test
We can compare the maximized log-likelihood of our model, \(\ell(\hat\beta; \mathbf x)\), versus the log-likelihood of the full model (aka saturated model aka maximal model), \(\ell_{\text{full}}\), which has one parameter per covariate pattern. With enough data, \(2(\ell_{\text{full}} - \ell(\hat\beta; \mathbf x)) \dot \sim \chi^2(N - p)\), where \(N\) is the number of distinct covariate patterns and \(p\) is the number of \(\beta\) parameters in our model. A significant p-value for this deviance statistic indicates that there’s some detectable pattern in the data that our model isn’t flexible enough to catch.
Caution
The deviance statistic needs to have a large amount of data for each covariate pattern for the \(\chi^2\) approximation to hold. A guideline from Dobson is that if there are \(q\) distinct covariate patterns \(x_1...,x_q\), with \(n_1,...,n_q\) observations per pattern, then the expected frequencies \(n_k \cdot \pi(x_k)\) should be at least 1 for every pattern \(k\in 1:q\).
If you have covariates measured on a continuous scale, you may not be able to use the deviance tests to assess goodness of fit.
3.9.2 Hosmer-Lemeshow test
If our covariate patterns produce groups that are too small, a reasonable solution is to make bigger groups by merging some of the covariate-pattern groups together.
Hosmer and Lemeshow (1980) proposed that we group the patterns by their predicted probabilities according to the model of interest. For example, you could group all of the observations with predicted probabilities of 10% or less together, then group the observations with 11%-20% probability together, and so on; \(g=10\) categories in all.
Then we can construct a statistic \[X^2 = \sum_{c=1}^g \frac{(o_c - e_c)^2}{e_c}\] where \(o_c\) is the number of events observed in group \(c\), and \(e_c\) is the number of events expected in group \(c\) (based on the sum of the fitted values \(\hat\pi_i\) for observations in group \(c\)).
If each group has enough observations in it, you can compare \(X^2\) to a \(\chi^2\) distribution; by simulation, the degrees of freedom has been found to be approximately \(g-2\).
For our CHD model, this procedure would be:
Show R code
wcgs<-wcgs|>mutate( pred_probs_glm1 =chd_glm_contrasts|>fitted(), pred_prob_cats1 =pred_probs_glm1|>cut( breaks =seq(0, 1, by =.1), include.lowest =TRUE))HL_table<-# nolint: object_name_linterwcgs|>summarize( .by =pred_prob_cats1, n =n(), o =sum(chd69=="Yes"), e =sum(pred_probs_glm1))library(pander)HL_table|>pander()
Our statistic is \(X^2 = 1.110287\); \(p(\chi^2(1) > 1.110287) = 0.29202\), which is our p-value for detecting a lack of goodness of fit.
Unfortunately that grouping plan left us with just three categories with any observations, so instead of grouping by 10% increments of predicted probability, typically analysts use deciles of the predicted probabilities:
Show R code
wcgs<-wcgs|>mutate( pred_probs_glm1 =chd_glm_contrasts|>fitted(), pred_prob_cats1 =pred_probs_glm1|>cut( breaks =quantile(pred_probs_glm1, seq(0, 1, by =.1)), include.lowest =TRUE))HL_table<-# nolint: object_name_linterwcgs|>summarize( .by =pred_prob_cats1, n =n(), o =sum(chd69=="Yes"), e =sum(pred_probs_glm1))HL_table|>pander()
Now we have more evenly split categories. The p-value is \(0.56042\), still not significant.
Graphically, we have compared:
Show R code
HL_plot<-# nolint: object_name_linterHL_table|>ggplot(aes(x =pred_prob_cats1))+geom_line(aes(y =e, x =pred_prob_cats1, group ="Expected", col ="Expected"))+geom_point(aes(y =e, size =n, col ="Expected"))+geom_point(aes(y =o, size =n, col ="Observed"))+geom_line(aes(y =o, col ="Observed", group ="Observed"))+scale_size(range =c(1, 4))+theme_bw()+ylab("number of CHD events")+theme(axis.text.x =element_text(angle =45))
BIC = \(-2 * \ell(\hat\theta) + p * \text{log}(n)\) [lower is better]
likelihood ratio [higher is better]
3.10 Residual-based diagnostics
3.10.1 Logistic regression residuals only work for grouped data
Show R code
library(haven)url<-paste0(# I'm breaking up the url into two chunks for readability"https://regression.ucsf.edu/sites/g/files/","tkssra6706/f/wysiwyg/home/data/wcgs.dta")library(here)# provides the `here()` functionlibrary(fs)# provides the `path()` functionhere::here()|>fs::path("Data/wcgs.rda")|>load()chd_glm_contrasts<-wcgs|>glm("data"=_,"formula"=chd69=="Yes"~dibpat*age,"family"=binomial(link ="logit"))library(ggfortify)chd_glm_contrasts|>autoplot()
Figure 3.19: Residual diagnostics for WCGS model with individual-level observations
Residuals only work if there is more than one observation for most covariate patterns.
Here we will create the grouped-data version of our CHD model from the WCGS study:
We can graph these residuals \(e_k\) against the fitted values \(\hat\pi(x_k)\):
Show R code
odds<-function(pi)pi/(1-pi)logit<-function(pi)log(odds(pi))wcgs_grouped<-wcgs_grouped|>mutate( fitted =chd_glm_contrasts_grouped|>fitted(), fitted_logit =fitted|>logit(), response_resids =chd_glm_contrasts_grouped|>resid(type ="response"))wcgs_response_resid_plot<-wcgs_grouped|>ggplot( mapping =aes( x =fitted, y =response_resids))+geom_point(aes(col =dibpat))+geom_hline(yintercept =0)+geom_smooth( se =TRUE, method.args =list( span =2/3, degree =1, family ="symmetric", iterations =3), method =stats::loess)
1
Don’t worry about these options for now; I chose them to match autoplot() as closely as I can. plot.glm and autoplot use stats::lowess instead of stats::loess; stats::lowess is older, hard to use with geom_smooth, and hard to match exactly with stats::loess; see https://support.bioconductor.org/p/2323/.]
We can see a slight fan-shape here: observations on the right have larger variance (as expected since \(var(\bar y) = \pi(1-\pi)/n\) is maximized when \(\pi = 0.5\)).
3.10.3 Pearson residuals
The fan-shape in the response residuals plot isn’t necessarily a concern here, since we haven’t made an assumption of constant residual variance, as we did for linear regression.
However, we might want to divide by the standard error in order to make the graph easier to interpret. Here’s one way to do that:
The Pearson (chi-squared) residual for covariate pattern \(k\) is: \[
\begin{aligned}
X_k &= \frac{\bar y_k - \hat\pi_k}{\sqrt{\hat \pi_k (1-\hat\pi_k)/n_k}}
\end{aligned}
\]
chd_glm_strat<-glm("formula"=chd69=="Yes"~dibpat+dibpat:age-1,"data"=wcgs,"family"=binomial(link ="logit"))autoplot(chd_glm_strat, which =1, ncol =1)|>print()
Meaningless.
Residuals plot by hand (optional section)
If you want to check your understanding of what these residual plots are, try building them yourself:
Show R code
wcgs_grouped<-wcgs_grouped|>mutate( fitted =chd_glm_contrasts_grouped|>fitted(), fitted_logit =fitted|>logit(), resids =chd_glm_contrasts_grouped|>resid(type ="pearson"))wcgs_resid_plot1<-wcgs_grouped|>ggplot( mapping =aes( x =fitted_logit, y =resids))+geom_point(aes(col =dibpat))+geom_hline(yintercept =0)+geom_smooth( se =FALSE, method.args =list( span =2/3, degree =1, family ="symmetric", iterations =3, surface ="direct"), method =stats::loess)# plot.glm and autoplot use stats::lowess, which is hard to use with# geom_smooth and hard to match exactly;# see https://support.bioconductor.org/p/2323/
For our CHD model, the p-value for this test is 0.265236; no significant evidence of a lack of fit at the 0.05 level.
Standardized Pearson residuals
Especially for small data sets, we might want to adjust our residuals for leverage (since outliers in \(X\) add extra variance to the residuals):
\[r_{P_k} = \frac{X_k}{\sqrt{1-h_k}}\]
where \(h_k\) is the leverage of \(X_k\). The functions autoplot() and plot.lm() use these for some of their graphs.
3.10.5 Deviance residuals
For large sample sizes, the Pearson and deviance residuals will be approximately the same. For small sample sizes, the deviance residuals from covariate patterns with small sample sizes can be unreliable (high variance).
Figure 3.23: Diagnostics for logistic model of BeetleMortality data
Hard to tell much from so little data, but there might be some issues here.
3.11 Objections to reporting odds ratios
Some scholars have raised objections to the use of odds ratios as an effect measurement (Sackett, Deeks, and Altman 1996; Norton et al. 2024). One objection is that odds ratios depend on the set of covariates in a model, even when those covariates are independent of the exposure of interest and do not interact with that exposure. For example, consider the following model:
Since the \(\text{expit}\left\{\right\}\) function is nonlinear, we can’t change the order of the expectation and \(\text{expit}\left\{\right\}\) operators:
If you want risk ratios, you can sometimes get them by changing the link function:
Show R code
data(anthers, package ="dobson")anthers_sum<-aggregate(anthers[c("n", "y")], by =anthers[c("storage")], FUN =sum)anthers_glm_log<-glm( formula =cbind(y, n-y)~storage, data =anthers_sum, family =binomial(link ="log"))anthers_glm_log|>parameters()|>print_md()
Parameter
Log-Risk
SE
95% CI
z
p
(Intercept)
-0.80
0.12
(-1.04, -0.58)
-6.81
< .001
storage
0.17
0.07
(0.02, 0.31)
2.31
0.021
Now \(\text{exp}\left\{\beta\right\}\) gives us risk ratios instead of odds ratios:
anthers_glm_logit<-glm( formula =cbind(y, n-y)~storage, data =anthers_sum, family =binomial(link ="logit"))anthers_glm_logit|>parameters(exponentiate =TRUE)|>print_md()
Parameter
Odds Ratio
SE
95% CI
z
p
(Intercept)
0.76
0.20
(0.45, 1.27)
-1.05
0.296
storage
1.49
0.26
(1.06, 2.10)
2.29
0.022
[to add: fitted plots on each outcome scale]
When I try to use link ="log" in practice, I often get errors about not finding good starting values for the estimation procedure. This is likely because the model is producing fitted probabilities greater than 1.
When this happens, you can try to fit Poisson regression models instead (we will see those soon!). But then the outcome distribution isn’t quite right, and you won’t get warnings about fitted probabilities greater than 1. In my opinion, the Poisson model for binary outcomes is confusing and not very appealing.
Norton, Edward C., Bryan E. Dowd, Melissa M. Garrido, and Matthew L. Maciejewski. 2024. “Requiem for Odds Ratios.”Health Services Research 59 (4): e14337. https://doi.org/https://doi.org/10.1111/1475-6773.14337.
Rosenman, Ray H, Richard J Brand, C David Jenkins, Meyer Friedman, Reuben Straus, and Moses Wurm. 1975. “Coronary Heart Disease in the Western Collaborative Group Study: Final Follow-up Experience of 8 1/2 Years.”JAMA 233 (8): 872–77. https://doi.org/10.1001/jama.1975.03260080034016.
Sackett, David L, Jonathan J Deeks, and Doughs G Altman. 1996. “Down with Odds Ratios!”BMJ Evidence-Based Medicine 1 (6): 164.
Vittinghoff, Eric, David V Glidden, Stephen C Shiboski, and Charles E McCulloch. 2012. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. 2nd ed. Springer. https://doi.org/10.1007/978-1-4614-1353-0.
---subtitle: "Logistic regression and variations"---# Models for Binary Outcomes {#sec-Bernoulli-models}---{{< include shared-config.qmd >}}```{r}options(digits =6)```### Acknowledgements {.unnumbered}This content is adapted from:- @dobson4e, Chapter 7- @vittinghoff2e, Chapter 5- [David Rocke](https://dmrocke.ucdavis.edu/)'s materials from the [2021 edition of Epi 204](https://dmrocke.ucdavis.edu/Class/EPI204-Spring-2021/EPI204-Spring-2021.html)- @NahhasIRMPHR [Chapter 6](https://www.bookdown.org/rwnahhas/RMPH/blr.html)## Introduction---{{< include _sec_overview_bernoulli_models.qmd >}}{{< include _sec_intro_bernoulli_models.qmd >}}## Introduction to logistic regression{{< include _sec_one_cov_logistic.qmd >}}## Fitting logistic regression models with maximum likelihood estimation{{< include _sec_logistic-fitting.qmd >}}## Multiple logistic regression{{< include _sec_exm_wcgs.qmd >}}## Model comparisons for logistic models {#sec-gof}{{< include _sec_logistic_gof.qmd >}}## Residual-based diagnostics{{< include _sec_logistic_dx.qmd >}}{{< include _sec-non-logistic-bernoulli-models.qmd >}}## Further reading- @hosmer2013applied is a classic textbook on logistic regression